Journal of Clinical Epidemiology
○ Elsevier BV
Preprints posted in the last 30 days, ranked by how well they match Journal of Clinical Epidemiology's content profile, based on 28 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Das, P.; Schneider, J.; Mayo-Wilson, E.; Kilicoglu, H.; Menke, J. D.; Nam, D.; Ninan, K.; Oberste, J.-P.; Troy, A. M.; Ying, X.; Holt, A. W.; Smalheiser, N. R.
Show abstract
Objectives: Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM's probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design. Materials and Methods: Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs--cohort, case-control, cross-sectional, and case report. Results: For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission. Discussion and Conclusion: TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.
Fagerberg, P.; Sallander, O.; Vikhe Patil, K.; Thunborg, C.; Lundstrom, L.; Berg, A.; Nyman, A.; Borg, N.; Linden, T.
Show abstract
Title and abstract screening limit the timeliness of systematic reviews used for clinical guidelines. We evaluated audited large language model (LLM) triage at Sweden's National Board of Health and Welfare. Ten LLMs from five model families were tested on 419 Cochrane reviews comprising 26,892 records, and the selected ensemble was externally validated on 133 reviews including 8,501 records matched to planned guideline topics. The same locked model pair was then used prospectively across 24 systematic reviews in two national guideline programmes. On the 419-review selection benchmark, the selected Gemini-3-flash plus GPT-5.1 ensemble achieved 98.0% (95% CI, 97.3-98.7) mean review-level sensitivity, while topic-matched validation yielded 96.7% sensitivity (95% CI, 93.7-98.9). Prospective deployment screened 74,679 records, placed 63,858 (85.5%) in the AI-excluded pool and reduced estimated first-pass screening effort from 415 to 34 person-days. Across 600 randomly sampled AI-excluded records from the migraine and dementia programmes, none was confirmed as a final false negative after post-unblinding adjudication; across the completed 680-record audit, all 38 final retained records had been AI flagged, whereas locked blinded human consensus missed seven. These findings support locked, audited LLM triage, with human oversight and programme-specific monitoring, for systematic reviews used in national guidelines.
Fazeli, M. S.; Kasireddy, E.; Pourrahmat, M.-M.; Chow, C.; Collet, J. P.
Show abstract
Background: Systematic literature reviews (SLRs) are essential in medical research, but are often time-consuming and costly, necessitating more efficient methods while maintaining accuracy. Objective: This study assessed the performance of a GPT-4o mini large language model (LLM) in automating the first phase of study selection based on titles and abstracts in systematic reviews. Specifically, we evaluated whether the model improved efficiency without compromising on quality. Methods: Structured prompts were created for a GPT-4o mini LLM to facilitate title and abstract screening. The model's performance was evaluated against expert human reviewers across five systematic reviews on inclusion rates, sensitivity, specificity, accuracy, positive predictive value, and negative predictive value. Results: The model screened a total of 15,605 records. It included a higher percentage of studies than human screeners, with 3.5% (n=549/15,605) true positives and 14.2% (n=2,218/15,605) false positives. The model achieved an overall accuracy of 85.1%, with a sensitivity of 83.2% and specificity of 85.2%. The positive predictive value was 19.8%, while the negative predictive value was 99.1%. The model was able to screen 1,000 titles and abstracts in 40 minutes, compared to 16 hours required by a human reviewer. Conclusion: This study demonstrated a strong performance and efficiency in the automation of title and abstract screening in SLRs using an advanced LLM. Further refinements could optimize the balance between sensitivity and specificity, supporting broader implementation in evidence synthesis. A hybrid AI-human approach is recommended to ensure accuracy, reduce reviewer burden, and maintain the methodological rigor required for high-quality SLRs.
Panagiotopoulos, A.-P.; Laskaris, A.; Tsakri, D.; Manoussopoulos, Y.; Anastassopoulou, C.; Tsakris, A.; Ioannidis, J.
Show abstract
Objectives To quantify the frequency of baseline control-group use in published long COVID prevalence studies and assess their key methodological features. Design Cross-sectional meta-epidemiological evaluation of published post-acute COVID-19 prevalence studies, supplemented by a corresponding-author survey. Setting Published studies identified through a systematic review by Hou et al. (2025) and supplementary data obtained through direct email contact with corresponding authors. Participants A total of 440 published long COVID prevalence studies. Main Outcome measures Presence and type of comparator group, reliance on solely self-reported outcomes, acknowledgment of lack of a control group among uncontrolled studies, and availability of additional comparator data through author survey. Results Among 440 studies, 372 (84.5%) reported no control group on their publication. Healthy or uninfected comparators were reported in 55 studies (12.5%) and other comparator types in 14 (3.2%); 1 study included both categories. Solely self-reported outcomes were used in 279 studies (63.4%). Among 372 uncontrolled studies, 244 (65.6%) did not explicitly acknowledge the absence of a baseline comparator as a limitation anywhere in text. Corresponding authors of 140 studies (31.8%) responded to the survey; among them, 126 (90.0%) reported no additional comparative data, while 14 (10.0%) mentioned some available comparative datasets (19 additional datasets). Almost all of that information (10/14, 17/19) had been already published in other articles not captured by the Hou et al. systematic review. Conclusions Most published long COVID prevalence studies lacked comparator groups and relied exclusively on self-reported outcomes without acknowledging this limitation. Direct author contact identified little additional comparator information. Much of the long COVID prevalence literature may therefore be poorly suited to estimating burden attributable specifically to SARS-CoV-2, underscoring the need for appropriately matched comparators and more objective outcome assessment. Registration The protocol was prospectively registered on the Open Science Framework (https://osf.io/f4hra).
Matos Porto, A. P.; Gomes, M. S.; de Oliveira, V. F.; Mwanja, H.; Zhu, N.; Holmes, A.; Levin, A. S.; Costa, S. F.
Show abstract
Background: Digital antimicrobial stewardship (AMS) interventions, such as clinical decision support systems, audit and feedback platforms, and electronic prescribing tools, have been increasingly adopted to improve antibiotic use. However, the effectiveness of these interventions across healthcare settings remains uncertain, and the certainty of the evidence has not been comprehensively evaluated. The objective of this study was to provide a comprehensive understanding of the role of digital interventions in optimizing antimicrobial use and improving clinical outcomes within a broad spectrum of healthcare settings. Methods: We conducted a systematic review and meta-analysis of randomized controlled trials evaluating digital AMS interventions that followed PRISMA 2020 guidelines and registered in PROSPERO CRD420251178854 and funded by the Wellcome Trust CAMO Net programme. Searches were performed across major databases. Primary outcomes included the appropriateness of antibiotic prescriptions and the antibiotic prescription rate. Secondary outcomes included 30 day mortality, 30 day hospital readmission, and length of hospital stay (LOS). Random effects models were used to pool effect sizes. Risk of bias was assessed RoB 2, and certainty of evidence was rated using GRADE. A Summary of Findings table was prepared to present effect estimates, sample sizes, and evidence certainty. Results: Eleven RCTs met the inclusion criteria, and nine were included in the quantitative synthesis. Digital AMS interventions did not show a significant effect on appropriateness of antibiotic prescribing (RR 0.99, 95%CI 0.93 to 1.05; very low certainty). There was no reduction in antibiotic prescription (RR 0.98, 95%CI 0.88 to 1.09), with substantial statistical heterogeneity and very low certainty. Across clinical outcomes, digital AMS showed no effect on 30 day mortality (RR 0.91, 95%CI 0.77 to 1.09; very low certainty) or 30 day readmission (RR 0.95, 95%CI 0.79 to 1.14; very low certainty). For LOS, results were inconsistent across studies, and the pooled effect showed no clinically meaningful change (MD 0.17 days, 95%CI 0.01 to 0.35; very low certainty). Most trials had some concerns of bias due to deviations from intended interventions. Conclusion: Meta-analyses of digital AMS RCTs showed a lack of evidence with a high level of certainty on antibiotic prescribing or clinical outcomes due to high heterogeneity in interventions and study designs, as well as RCTs' limitations (no adoption/fidelity metrics).
Irlmeier, R.; Jin, Z.; Ye, F.
Show abstract
Background Simon two-stage designs for binary endpoints and their time-to-event analogues, including the Kwak and Jung method, rely on a fixed null benchmark. Their Type I error control is valid only when that benchmark is correctly specified. In practice, historical benchmarks are often inconsistent due to small samples, population heterogeneity, changing eligibility criteria, and evolving standards of care. Even modest misspecifications can substantially inflate the Type I error rate, leading to costly advancement of ineffective treatments. Methods We propose the Interval-Null Robust (INR) two-stage design framework that accounts for uncertainty in the historical null benchmark. We define the null hypothesis as a plausible range of clinically uninteresting values: p[isin][p0L, p0U] for binary endpoints and {lambda}[isin][{lambda}0L, {lambda}0U] (or equivalent survival probabilities) for time-to-event endpoints. Type I error is controlled uniformly over the full null interval: sup{theta}[isin]{theta}0 Pr{theta}(Go) [≤] . Under the monotonicity of the Go probability, the supremum occurs at the least favorable null configuration - p0U and {lambda}0L - but the design is not reduced to a point-null formulation. The interval defines the uncertainty set for error control and is used in selecting among feasible designs through robust criteria such as worst-case regret or minimal average expected sample size. Results Across representative planning scenarios for both endpoint types, classic designs calibrated to a single benchmark exhibit substantial Type I error inflation when the true null parameter exceeds the assumed planning value. INR designs maintain the nominal Type I error rate across the full null interval, directly addressing this vulnerability to benchmark misspecification. The robustness-efficiency trade-off can be managed through design constraints and robust optimization criteria while preserving uniform Type I error control. Conclusions INR two-stage designs offer a transparent framework for addressing historical control uncertainty in single-arm Phase II trials. By replacing reliance on a fixed benchmark assumption with a more realistic interval of clinically plausible null values, INR design reduces the risk of false-positive Go-decisions caused by benchmark misspecification. INR applies to both binary and time-to-event endpoints and is implemented in the open-source INRDesign R package and accompanying interactive Shiny app.
Sah, B. K.; Li, J.; Zhang, M.; Jin, R.; Li, X.; Dong, C.; Chen, E.
Show abstract
Background Gastric cancer management is heterogeneous, and although the treating surgeon leads decisions across the pathway, surgeon level outcome variation remains poorly quantified. This study assessed surgeon identity as an independent predictor of survival after risk adjustment, introducing the Surgical Assessment and Healthcare (SAH) Index. Methods This single institution retrospective study (Ruijin Hospital, Shanghai Jiao Tong University; NCT07180966) included 692 patients undergoing curative-intent resection for gastric adenocarcinoma (pStage I ,II, III) in 2019 by eight consultant surgeons. Overall survival was modelled by multivariable Cox regression (primary model, 199 events, EPV 16.6; complete-case sensitivity model, N = 647). The SAH Index expressed surgeon * stage observed-to-expected ratios for five-year mortality and major morbidity (Clavien Dindo [≥] IIIa). Median follow up was 74.3 months. Results Independent predictors of survival were tumour stage (HR 2.979/step), age (HR 1.030/year), and non-distal gastrectomy (HR 1.498; all p [≤] .006). After full adjustment, surgeon identity remained significant (Wald = 14.58, df = 7, p = .042): two surgeons carried roughly double the reference hazard S6 (HR 2.219, p = .003) and S8 (HR 2.034, p = .031) both with the cohort's lowest neoadjuvant chemotherapy rates (3.0% and 7.0% versus 17.6%), implicating pre-operative pathway decisions. The effect persisted in the sensitivity model (MSI also prognostic, HR 3.162, p = .007). Morbidity benchmarking flagged no surgeon for excess complications (no Tier 2 flags) and one survival-outlier cell (S6, Stage II; Tier 3). Conclusion Surgeon identity is independently associated with survival in gastric cancer beyond measurable case mix. The SAH Index offers a reproducible tool for institutional and inter-hospital benchmarking, with tier assignments stable across all four prespecified weighting scenarios confirming tier classification is independent of weight specification.
Pears, M.; Wadhwa, K.; Payne, S. R.; Konstantinidis, S. T. H.; Biyani, C. S.
Show abstract
Large language models (LLMs) such as ChatGPT are rapidly reshaping healthcare education and simulation-based training in non-technical skills (NTS), yet no bibliometric analysis has mapped this landscape. We searched seven open-access databases (OpenAlex, PubMed, Europe PMC, Crossref, Semantic Scholar, CORE, DOAJ) for English-language publications from January 2020 to March 2026. From 100,277 initial records, a sequential keyword funnel yielded 830 candidate papers, which were screened by 83 independent Claude Sonnet 4.6 AI agents applying pre-specified inclusion criteria (PRISMA-trAIce compliant; Cohen's kappa = 0.86 pre-reconciliation, 1.0 post-reconciliation). The final AI-verified corpus comprised 551 papers with a compound annual growth rate of 109%, contributions from 2,398 authors across 279 journals in 58 countries, and an h-index of 41. ChatGPT dominated the model landscape (46% of papers), with open-source models virtually absent. Virtual patient chatbots were the leading simulation modality (106 papers). Among NTS domains, communication (145 papers) and decision-making (135 papers) were most studied, whereas teamwork, leadership, situational awareness, and crisis resource management were markedly underrepresented. Only 6 urology-relevant papers were identified, none examining LLM integration within boot camp training formats. The field is growing at extraordinary pace but remains concentrated in a narrow range of NTS domains and a single proprietary model. Critical gaps persist in team-based skills training, open-source model evaluation, and specialty-specific simulation. AI-assisted bibliometric screening using multiple independent agents is feasible, reliable, and scalable, offering a replicable methodology for mapping fast-evolving research fields.
Woelfle, T.; Fucile, G.; Hirt, J.; Pena, R. C. G.; Vogt, M.; Nordhausen, T.; Ewald, H.; Appenzeller-Herzog, C.
Show abstract
Systematic Review (SR) is a prosperous study type in modern medicine and beyond. Many SR authors complement their primary database searches by supplementary techniques. Among these, citation-based techniques known as citation searching (CS) are widespread. Unranked Direct CS (UDCS) to identify directly cited and citing literature of seed references is currently most prevalent. Ranked (In)direct CS (RICS) additionally collects co-cited and co-citing literature combined with a ranking and cut-off procedure. However, RICS workflows remain non-standardized and tedious, and associated benefits unclear. This work aims to create a framework for the prospective international comparison of supplementary UDCS and RICS. To prime RICS research, we developed the open-source Co*Citation Network application and assessed parallel supplementary UDCS and RICS retrospectively in three completed SRs and prospectively in one case study. Automated RICS collected and ranked cited, citing, co-cited, and co-citing literature of seed references from OpenAlex database and applied an empirical rank cut-off to approximate the volume of UDCS results. In RICS compared to UDCS, we consistently noted higher overlap with primary database search results. Title/abstract screening in the case study showed a precision (number needed to read) of 1.8% (57) for UDCS and 2.1% (48) for RICS results. After full text screening, two additional articles were included for review, one of which was identified by UDCS and RICS, and one exclusively by UDCS. The present study indicates potential benefits of RICS for SR authors and will enable the formation of a research consortium to compare supplementary UDCS and RICS on larger scale.
Doneva, S. E.; Ellendorff, T. R.; Schneider, G.; Held, L.; von Wyl, V.; Simpson, I.; Sick, B.; Ineichen, B. V.
Show abstract
BackgroundLarge-scale estimates of animal-to-human drug translation and the study characteristics associated with successful translation remain limited. The expanding preclinical literature also challenges manual evidence synthesis. We developed a natural language processing (NLP) pipeline to structure and link preclinical and clinical evidence at scale. MethodsIn this retrospective meta-research study, we analysed more than 500,000 neuroscience-related animal drug studies from PubMed and linked them to clinical trial and regulatory approval data. NLP methods extracted drug, disease, and experimental design characteristics from abstracts and full texts. Translation was defined as progression to completed phase III/IV trials or regulatory approval. Logistic regression assessed associations between preclinical study characteristics and successful translation. FindingsAmong 291,624 drug entities identified in animal studies, 6{middle dot}7% entered clinical development and 3{middle dot}1% reached phase III/IV trials or regulatory approval. At the drug-disease level, 4{middle dot}4% entered clinical development and 1{middle dot}9% achieved translation. Restricting analyses to successfully linked ontology entities increased estimates to 11{middle dot}3% and 4{middle dot}1%, respectively. Male-only animal studies predominated, whereas reporting of randomisation, blinding, and sample size calculations remained limited. Testing across multiple species and reporting blinding were associated with higher odds of successful translation. InterpretationOnly a minority of interventions tested in animals progress to advanced clinical development or regulatory approval. Greater species diversity and blinding were associated with improved translational success. NLP-based evidence synthesis may support scalable evaluation of translational research and identification of potentially modifiable research practices. FundingSwiss National Science Foundation, UZH Digital Entrepreneurship Fellowship, Universities Federation for Animal Welfare. Research in contextO_ST_ABSEvidence before this studyC_ST_ABSWe searched the literature for studies quantifying large-scale animal-to-human translation and factors associated with successful translation. Existing work was mainly limited to specific diseases, interventions, or manually curated datasets, and large-scale linkage of animal and clinical evidence remained limited. Added value of this studyWe developed a natural language processing pipeline linking more than 500,000 animal studies to clinical trial and regulatory approval data. The study provides large-scale estimates of translation and identifies experimental characteristics associated with successful translation. Implications of all the available evidenceThe findings suggest that only a minority of interventions tested in animals progress to advanced clinical development or regulatory approval. Greater species diversity and reporting of blinding were associated with improved translation. Automated evidence synthesis may support more systematic evaluation of translational research practices.
Avenell, A.; Bishop, D.
Show abstract
Background: In 2024, the BMJ updated its data-sharing policy for clinical trials, requiring deidentified individual patient data (IPD) to be openly deposited prior to publication. Our objective was to discover if data-sharing increased after introduction of the new policy. Method: All data-sharing statements were downloaded from BMJ trials published in 2023 (submitted pre-updated policy) and 2025 (submitted post-updated policy). Data for 2025 were gathered for trials in five comparison medical journals. Data-sharing statements were coded to specify whether IPD were immediately available, and if not, the reason why. Where a statement gave a link to a repository, we checked whether data were available. Results: Openly available IPD for BMJ trials increased from 0/32 prior to the new policy to 19/33 (58%) after the updated policy; seven articles gave repository links that did not yield any data. In the five comparison journals, rates of open IPD varied from 0% to 5.6%. Conclusions: There was a substantial increase in open sharing of IPD after introduction of the new policy compared to a prior period. Open sharing of IPD is possible, but it is unpopular with authors and is unlikely to be achieved without firm editorial enforcement
Schirle, L.; Babel, M.; Briem, J.-S. J.; Gawehn, N.; Janka, H.; Metzendorf, M.-I.; Trunk, E.; Wohlleben, J.; Weibel, S.; Spiegler, J.
Show abstract
Aim: To systematically evaluate evidence on the effects of post-discharge early developmental intervention programs (EI) on behavioral development, quality of life, participation, executive functioning, parent-child interaction, and use of medical services from infancy through adolescence in children born preterm. Method: Four bibliographic databases and one trial registry were systematically searched for randomized controlled trials up to April 23, 2024. Two reviewers independently screened studies and extracted data. In clinically and methodologically comparable studies, random-effects meta-analysis were performed. Risk of bias was assessed with the Cochrane RoB 2 tool, and certainty of evidence with the GRADE approach. Results: Twenty-six studies met inclusion criteria, eleven studies including 2,315 preterm born infants reported relevant outcomes, and seven contributed to meta-analyses. Most reported results showed some concerns or high risk of bias; certainty of evidence ranged from very low to moderate across outcomes. EI may offer small benefits for selective attention, behavioral problems and parent-child interaction. Little to no effect was found for special educational needs, language skills, executive functioning and the use of medical services. No included studies evaluated the effect of EI on ADHD, quality of life, or participation related to mobility or leisure activities. Interpretation: EI may improve problems typically seen in preterm children and should be offered especially to those with additional medical or social risk factors. High-quality, contemporary trials are needed to establish reliable clinical recommendations regarding EI strategies and complementary interventions throughout childhood.
Carlisle, B. G.; Hutchinson, N.; Moyer, H.
Show abstract
Background: The global SARS-CoV-2 pandemic disrupted healthcare systems worldwide, raising concerns about its impact on clinical research. Early reports suggested reductions in participant enrollment, interruptions to ongoing trials, and challenges to protocol adherence, yet the magnitude and duration of these operational disruptions remain unclear. Methods: We conducted a registry-based analysis comparing clinical trials during the COVID-19 pandemic (December 2019 to November 2022) with a matched pre-pandemic cohort (December 2016 to November 2019). Studies were included if they reported any modifications to trial status, enrollment, or protocols during the study periods. Key variables included trial stoppage, enrollment changes, and adoption of remote or hybrid procedures. Results: The global SARS-CoV-2 pandemic resulted in widespread disruptions to trial operations with 13,323 clinical trials terminated, suspended or withdrawn over the course of the pandemic, a 38% increase compared to the 9,665 trials that stopped in the 3 years prior to the pandemic. Registries indicated a sharp decline in new participant enrollment across geographic regions and therapeutic areas, with partial recovery in later months. Review findings highlighted barriers including patient inaccessibility, staff redeployment, and supply chain interruptions. Conclusions: The pandemic caused system-wide operational shocks that compromised trial timelines and may have downstream methodological consequences. Recovery in enrollment does not imply restoration of pre-pandemic protocol fidelity or outcome ascertainment. Standardized reporting of disruptions, proactive contingency planning, and resilient trial designs are needed to maintain data integrity during large-scale disruptions and to support reliable evidence generation.
Madison, M.; Wheaton, L. A.; Rowe, V.
Show abstract
Background: Occupational therapists can improve stroke survivors hand and arm movement and participation in daily activities through action observation (AO). AO involves watching another persons hand or arm complete a movement or task. While research generally supports the use of AO with stroke survivors, there are limited AO videos are available to occupational therapists which makes applying AO challenging. Objective: The purpose of this work is to develop structured and widely accessible tool to support access to AO for stroke survivors, occupational therapists, and researchers. Methods: To develop an AO video library for stroke rehabilitation, functional and non-functional upper limb task deficits were first identified through clinical observations and clinician interviews to establish a prioritized list of daily activities. In collaboration with media production specialists, healthy adult volunteers were recruited and filmed performing these tasks from both first- and third-person perspectives. The recorded videos were then systematically edited, enhanced with instructional title slides, and distributed via a public YouTube channel for clinical application and a categorized digital repository for research purposes. Results: Initial assessments revealed a complete lack of familiarity, awareness, and utilization of AO resources among local occupational therapists, despite high perceived clinical utility. To address this gap, a final library of 150 tasks was established, resulting in the production of 419 finalized, standardized videos featuring six healthy volunteers. For clinical application, these videos were hosted on a free, public YouTube channel organized into 18 functional playlists, while a parallel set was structured into distinct movement categories for research repository storage. Conclusion: By providing a structured and highly accessible tool, this repository enables clinicians, researchers, and caregivers to readily implement evidence-based action observation interventions in both clinical and home settings.
Bowen, H. P.; O'Loughlin, G.; Schleicher, C.; Schulthess, D.
Show abstract
Background: The impact of the Inflation Reduction Act (IRA) upon late-stage developments has been assumed to be limited. The Congressional Budget Office's IRA analysis excluded post-approval innovation, potentially overlooking substantial economic risks to drug developers and declines in the availability of treatments in areas of high unmet medical need such as oncology. Methods: A total of 1148 secondary trials from 364 FDA-approved medicines, published from 2018 to 2025, were obtained from Biomedtracker and clinicaltrials.gov. Using fractional multinomial logit, we model the share distribution of secondary indication studies across 19 disease groups and assess the change in this distribution post-IRA. We also assessed the number of secondary treatment studies pre- vs. post-IRA using multiple linear regression. Results: After the IRA's introduction, small molecule follow-on studies in oncology exhibited a statistically significant 35% decline (R2 = .48, p < 0.014) and lead indication small molecule oncology approvals exhibited a statistically significant 27% decline (R2 = .70, p < 0.002). We also find a statistically significant 14% decline in the share of orphan oncology studies pre- vs. post-IRA (p<0.001). Research Conclusions: This study's results refute claims that the IRA would have minimal negative effects on patient access or late-stage biopharmaceutical R&D. We hope this study reinvigorates debate about the law's unintended consequences and encourages thoughtful policy solutions, as the IRA manifestly creates disincentives that negatively impact patients seeking needed new medicines, particularly those requiring cures addressing metastatic late-stage cancers.
Acosta-Monterrosa, A. A.; Hernandez-Paez, D. A.; Visconti-Lopez, F. J.; Kalokoh, S.; Lozada-Martinez, I. D.
Show abstract
BackgroundQuantifying the alignment between scientific production and population-level indicators remains a persistent methodological challenge in health research evaluation. While longitudinal ecological models have been increasingly used to explore associations between research output and societal outcomes, their feasibility, interpretability, and structural limitations have not been systematically examined. MethodsWe conducted a longitudinal ecological meta-research analysis integrating global bibliometric data on mental health publications with country-level indicators of mental disorders, mental health infrastructure, and subjective well-being. Analyses were stratified by World Bank income groups and implemented using a three-step framework comprising income specific linear regression models, random-effects meta-analyses, and meta-regressions to assess association patterns, heterogeneity, and potential moderators. ResultsScientific production was highly concentrated in high-income countries. Income-stratified regression models revealed divergent association patterns across contexts, with inverse associations observed in higher income groups and predominantly positive coefficients in low-income countries. Meta-analyses showed extreme between-group heterogeneity for most indicators, yielding largely attenuated pooled estimates. Only one subjective well-being indicator retained a significant pooled association. ConclusionsLongitudinal ecological models linking scientific production to population-level indicators can identify broad association patterns and structural asymmetries but are strongly constrained by contextual heterogeneity and data availability.
Kang, W. J.; Sim, J.; Loh, E. E. M.; Lim, A. C. Y.; FOONG, K. W. C.
Show abstract
Importance. Large language models are increasingly explored as clinical decision support tools in orthodontics, yet existing evaluations have been confined to knowledge based question answering where reported accuracy ranges from 18% to 100%. No study has evaluated performance on the computational and classificatory tasks that define daily diagnostic work. Furthermore, 84.3% of published healthcare large language model studies fail to report the number of repeated queries performed, leaving output stochasticity unexamined. Objective. To compare the diagnostic accuracy and output consistency of three frontier reasoning-enhanced large language models, namely, ChatGPT 5.4 (Thinking), Gemini 3 (Thinking), and Claude Opus 4.6 (Extended Thinking), on Bolton analysis, Index of Orthodontic Treatment Need-Dental Health Component (IOTN DHC) classification, space analysis, and lateral cephalometric interpretation. Methods. In this comparative cross-sectional study with a repeated-measures design, each model, accessed through its respective consumer facing web interfaces under default provider settings rather than through application programming interfaces, processed 200 purpose-built items (50 per task) across four independent trials, yielding 2,400 observations. Responses were scored against a pre-established reference standard by two independent raters using strict binary exact match criteria. Accuracy was reported with exact binomial 95% confidence intervals. Inter-model comparisons used Cochran's Q test with post-hoc McNemar's tests and Bonferroni correction. A supplementary context-rich prompting evaluation was conducted on 40 items (480 observations). Results. Claude Opus 4.6 (Extended Thinking) achieved the highest accuracy (99.0%; 95% CI: 96.4 to 99.9%), followed by Gemini 3 (Thinking) (95.5%; 91.6 to 98.1%) and ChatGPT 5.4 (Thinking) (94.0%; 89.8 to 96.9%) (Cochran's Q=6.87, p=0.032). Each model exhibited distinct, non-overlapping error profiles concentrated at the normal-abnormal classification boundary. An accuracy-consistency paradox emerged: the most accurate model was the least consistent (93.0%), while the least accurate was the second-most consistent (98.0%). Context rich prompting eliminated all errors across all three models. Interpretation. Frontier reasoning large language models achieved high overall accuracy on orthodontic diagnostic tasks but retained concealed, task-specific vulnerabilities detectable only through repeated-trial evaluation. An accuracy-consistency paradox, in which the most accurate model was the least consistent, demonstrates that single-trial evaluations cannot characterise clinical risk. The reasoning modes were associated with high arithmetic accuracy but did not compensate for imprecise parametric knowledge on classification tasks; however, the absence of a non-thinking baseline means this association cannot be attributed to the thinking mode itself. Context-rich prompting eliminated all errors on synthetic data but should be regarded as a necessary yet insufficient prerequisite for clinical deployment pending prospective validation on real patient data.
Kelly, R. E.
Show abstract
Null Hypothesis Significance Testing (NHST) remains the dominant paradigm for evaluation of empirical research findings in medicine and the social sciences despite concerns about frequent misinterpretations of those findings. Achievement of "statistical significance," the goal of NHST, often beckons unrealistic conclusions. Helpful would be the addition of a broader, Bayesian perspective of research in terms of progressive readjustment of hypothesis credibility from all sources of evidence. For this purpose, the Hypothesis Race Model (HRM) provides an intuitive Bayesian approach that builds upon NHST-concepts, helping to correct misunderstandings with minimal reeducation. The HRM is an extension of the Bayesian approach by Ioannidis in 2005 that helped to explain "why most published research findings are false." It is powerful enough to serve as the foundation for mathematical models to estimate and reduce the cost of empirical hypothesis testing.
Kleper, S. L.; Melamed, R. D.
Show abstract
Machine learning models for causal inference aim to adjust for confounding factors that are associated with both an exposure and an outcome, creating a spurious biased association. But, these methods are rarely empirically evaluated to assess their success in mitigating such bias. Recent advances in knowledge representation, including both foundation models and knowledge graphs, could enrich these models, but rigorous evaluations are needed in order to assess their potential. Here, we ask whether enriching existing causal inference models with knowledge representations from foundation models can improve confounding control. Rather than using semi-simulated data to address this question, we focus on examples of real confounding: we emulate target randomized active comparator trials that are subject to confounding by indication. Our results can guide researchers aiming to develop or apply methods for discovering causal effects from observational data.
Arshad, A.; Carey, K. A.; Daniels, L. A.; Jani, P.; Gilbert, E.; Sanchez-Pinto, L. N.; Mayampurath, A.
Show abstract
Objective: Readmissions to the PICU are associated with increased morbidity and mortality. A prediction model that can identify children at risk of readmission at the time of transfer can allow providers to intervene and potentially improve patient outcomes. The objective of this study was to derive and validate machine learning models to predict PICU readmission at the time of transfer. Design: Retrospective observational cohort study Setting: Three quaternary care PICUs in the city of Chicago Patients: All children admitted to the PICU between 2012 and 2019. Measurements: The primary outcome was unplanned readmission to the PICU within 48 hours of transfer to the inpatient ward. Predictor variables included vital signs, patient characteristics, and laboratory results. We developed and externally validated four models to predict PICU readmission: logistic regression, elastic net, random forest, and XGBoost. Main Results: This study included 35,601 patients, with readmission rates ranging from 2.2-3.7% by site. The performance of models during internal validation was consistent at the three sites, with the area under the receiver operating characteristic (AUC) values between 0.70 and 0.73 and no difference across the four models. Model performance decreased significantly during external validation (AUCs of 0.60-0.69). The variables most important to the prediction differed at each site. Conclusion: Machine learning models for predicting readmissions to the PICU have limited generalizability. Locally derived models demonstrated modest performance in our study and could potentially inform provider decision-making if prospectively validated. Externally developed models are unlikely to perform well at predicting PICU readmissions.